Data wrangling exercises are done in the corresponding R script.
Firstly, the Boston dataset is loaded from the MASS package:
library(MASS)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The data has 14 variables, all of them are either numeric or integer. Variables include e.g. crime rate by town per capita (crim), nitrogen oxides concentration (nox), index of accessibility to radial highways (rad), pupil-teacher ratio by town (ptratio), average number of rooms per dwelling (rm) and others.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
rad (index of accessibility ro radial highways) varies from 1 to 24. lstat (lower status of the population (in percents)) has several outliers: min lstat is 1.73%, max lstat is 37.97 %, and the mean is 12.65%. There is also a noticeable outlier in crim variable: max crim = 87.98. An outlier is also present in black variable (1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town): min black = 0.32.
pairs(Boston)
There is a clear hyperbolic relationship between nox and dis (weighted mean of distances to five Boston employment centres) variables: the more the concentration of nitrogen oxides, the less the weighted mean of distances to employment centers. Thus, we can say that there is a correlation between nitrogen oxides and employments centers.
There is also a hyperbolic relationship between lstat and medv (median value of owner-occupied homes in $1000s) variables: the lower the status of the population, the less the median cost of the homes in the area, which is pretty logical.
There is almost a linear correlation between rm (average number of rooms per dwelling) and lstat variables: the more rooms in the dwelling, the more the median cost of the homes and vice versa.
Now the dataset will be standardized and a new variable will be added to the dataset (the old variable crim will be dropped):
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
#creating a new variable
breakpoints <- quantile(boston_scaled$crim)
labels <- c('low','med_low','med_high','high')
crime <- cut(boston_scaled$crim, breaks = breakpoints, include.lowest = TRUE, label=labels)
#dropping and adding variables
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Diving the data to train and test data:
n <- nrow(boston_scaled)
random_rows <- sample(n, size = n * 0.8)
train_data <- boston_scaled[random_rows,]
test_data <- boston_scaled[-random_rows,]
Fitting and plotting the LDA:
lda.fit <- lda(crime ~ ., data = train_data)
classes <- as.numeric(train_data$crime)
plot(lda.fit, dimen = 2,col=classes,pch=classes)
Predicting the classes with the LDA model:
correct_classes <- test_data[,"crime"]
test_data <- dplyr::select(test_data, -crime)
lda.pred <- predict(lda.fit, newdata = test_data)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 5 2 0
## med_low 10 17 6 0
## med_high 1 9 11 1
## high 0 0 0 26
Overall results show that high crime rate was predicted correctly (there is only one case when medium high crime rate was predicted wrong as high). Low crime rate was predicted correctly for 17 cases, for 13 cases it was predicted as med_low and for 2 - as med_high. Medium low crime rate was predicted correctly 15 times with only 6 errors.
Calculating the distances and visualizing the clusters:
library(MASS)
data('Boston')
boston_scaled_again <- scale(Boston)
dist_eu <- dist(boston_scaled_again)
km <-kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)
The optimal number of clusters is 2: 3 look good already, but one of them (the black one) doesn’t seem to be of great significance. More than 3 clusters is abundant. In case of rad, tax and ptratio the red cluster clearly shows outliers. There are overlapping clusters in lstat and medv plot: one is corresponding to the bigger amount of people who are of lower status and to the smaller median cost of homes, and the other is corresponding to the bigger amount of people with higher status and to bigger median cost of homes (basically, it is poverty vs richness dichotomy)
Bonus Task:
model_predictors <- dplyr::select(train_data, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
install.packages("plotly")
## Installing package into '/home/aleksandrako/R/x86_64-pc-linux-gnu-library/3.6'
## (as 'lib' is unspecified)
library("plotly")
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#colors from the crime classes
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color=train_data$crime)